-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some things you should change but nothing major. Also there are some spelling errors that should be fixed. You can use the hunspell package to check your scripts for typos.
@@ -1,13 +1,13 @@ | |||
#!/usr/bin/env Rscript | |||
library("optparse") | |||
if (!require(optparse)) install.packages("optparse"); library(optparse) | |||
|
|||
option_list <- list( | |||
make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster_id -> cluster id
|
||
option_list <- list( | ||
make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"), | ||
make_option(opt_str = c("-p", "--prefix"), default = "" , help = "Prefix for file names. Default = '%default'", metavar = "character"), | ||
make_option(opt_str = c("-m", "--min_seq"), default = 100, help = "Minimum amount of sequences in clusters. Default = %default", metavar = "integer") | ||
) | ||
|
||
opt_parser <- OptionParser(option_list = option_list, | ||
opt_parser <- OptionParser(option_list = option_list, | ||
description = "Convert BED-file to one FASTA-file per cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Author and email are missing.
|
||
option_list <- list( | ||
make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"), | ||
make_option(opt_str = c("-p", "--prefix"), default = "" , help = "Prefix for file names. Default = '%default'", metavar = "character"), | ||
make_option(opt_str = c("-m", "--min_seq"), default = 100, help = "Minimum amount of sequences in clusters. Default = %default", metavar = "integer") | ||
) | ||
|
||
opt_parser <- OptionParser(option_list = option_list, | ||
opt_parser <- OptionParser(option_list = option_list, | ||
description = "Convert BED-file to one FASTA-file per cluster") | ||
|
||
opt <- parse_args(opt_parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Sequences of each cluster are written as an FASTA-file.
if (is.null(bedInput)) { | ||
stop("ERROR: Input parameter cannot be null! Please specify the input parameter.") | ||
} | ||
|
||
bed <- data.table::fread(bedInput, sep = "\t") | ||
|
||
|
||
# Get last column of data.table, which refers to the cluster, as a vector. | ||
cluster_no <- as.vector(bed[[ncol(bed)]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove as.vector. Using [[]] already returns a vector.
# Split data.table bed on its last column (cluster_no) into list of data.frames | ||
clusters <- split(bed, cluster_no, sorted = TRUE, flatten = FALSE) | ||
|
||
# For each data.frame(cluster) in list clusters: | ||
discard <- lapply(1:length(clusters), function(i){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nicer to use seq_len instead of 1:x.
#' @contact rene.wiegandt(at)mpi-bn.mpg.de | ||
merge_similar <- function(tsv_in, file_list, min_weight){ | ||
|
||
files <- unlist(as.list(strsplit(file_list, ","))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The as.list is redundant strsplit already returns a list.
|
||
# split the string on the character "." in the first to columns and safe the last value each, to get the number of the cluster. | ||
tsv <- data.table::fread(tsv_in, header = TRUE, sep = "\t",colClasses = 'character') | ||
query_cluster <- unlist(lapply(strsplit(tsv[["Query_ID"]],"\\."), function(l){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use vapply instead it already returns a vector.
tail(l,n=1) | ||
})) | ||
|
||
# create data.table with only the cluster-numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to convert later if you create your data.table with numeric columns right away.
sim_not_unique[, query_cluster := as.numeric(query_cluster)] | ||
sim_not_unique[, target_cluster := as.numeric(target_cluster)] | ||
|
||
# remove rows if column 1 is idential to column 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*identical
system(paste("cat",f,">",basename(f))) | ||
}) | ||
} | ||
# merge FASTA-files depending on the clustered graphs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One or two more comments would be nice in this lapply.
Link #16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Improved motif clustering by comparing the motifs of each cluster separately with the merged motif file.
Added new R-script which labels the TSV-files with the corresponding cluster ID.